Index

  1. Introduction

  2. Image display (oro.dicom)

  3. Radiomic features (RIA)

  4. Segmentation

    4.1. Segmentation with lungct

    4.2. Segmentation with Python (I)

    4.3. Segmentation with Python (II)

    4.4. Segmentation with lungmask

  5. Radiomic features with masks


1. Introduction to radiomics

Medical images are essential for studying diseases. For example, to examine how a patient’s tumor evolves, we perform multiple scans over time. Interpreting how the tumor has changed by looking at the images is not an easy task, and different doctors may predict different outcomes. This is where computers become very useful. Thanks to artificial intelligence techniques, we can show computers different medical images over time of a patient with a tumor, and they can identify features of these images to make predictions. They can help decide whether a treatment for a tumor is working or not, decide which alternative treatment to use, and much more, without having to wait for the tumor to get worse.

In the last decades, several areas of human activities have experienced an increase in digitization. In the case of medicine, a significant amount of information generated during clinical routine has been digitized. With this digital increase, new and better software has been developed to analyze the data. Thanks to research in Artificial Intelligence, these methods have become very powerful and available to any user, allowing doctors to use them on a daily basis.

As the amount of data increases, different Artificial Intelligence techniques - mainly Machine Learning and Deep Learning - are of high utility to deal with this large amount of data, an area known as “Big Data”. In simple terms, Big Data refers to sets of data whose size, complexity and speed of growth make it difficult to analyze using traditional tools.

Radiomics is a field of medical study whose purpose consists of extracting a large volume of features from medical images using data characterization algorithms. These features are known as radiomic features and can help to discover tumor patterns that are difficult for the human eye to analyze. Radiomics is a relatively new scientific filed. The first radiomics studies appeared in PubMed as recently as 2011.

It is believed that, in the end, radiomics will use specific treatments for each patient, will help doctors select the most appropriate treatment for each patient, and will be able to change treatments quickly if they do not work.

Some of the difficulties when performing radiomics research is that high quality images, with adequate size and complete datasets are needed. Different training and validation datasets are also necessary, to check if our algorithm works correctly. Another difficulty is class imbalance (classification problem where the number of observations per class is not equally distributed) and overfitting (when a statistical model exactly fits its training data). The main clinical applications of radiomics are radiogenomics and clinical outcome prediction.

Process of radiomics

Now we will briefly explain the process of radiomics.

  1. Obtain image: first, we obtain a medical image from a scanner (it can be obtained from multiple modalities: magnetic resonance imaging (MRI), computed tomography (CT), positron-emission-tomography (PET)…).
  2. Image segmentation: this means dividing the image into multiple segments, in other words, delineating the areas of interest in the image in terms of pixels or voxels. This step can be done manually, semi-automatically or fully automatically.
  3. Feature extraction: after image segmentation, we can extract the features and classify them.
  4. Finally, we use databases to share our data.
DICOM images

In this project, we will study images obtained from scans of the lungs. The images we will obtain from the scans are DICOM images. DICOM (Digital Imaging and Communications in Medicine) is the standard for the communication and management of medical imaging information and related data. It is used worldwide to store, exchange, and transmit medical images. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use. DICOM incorporates standards for imaging modalities such as radiography, ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), and radiation therapy. The most common applications of this standard are the display, storage, printing, and transmission of images.

Radiomic techniques

To obtain the features, we can use multiple techniques. Radiomic techniques can be divided into four categories: intensity-based metrics, texture-based analysis, shape-based measures, and transform-based metrics. We will briefly discuss these techniques now.

  • Intensity-based metrics refers to statistics that are calculated from pixel values (or volumetric pixels called voxels). Additional information that can be obtained from analyzing the relationship between the voxels it is not considered. To compute the statistics, we select a region and extract the voxel values. They can be analyzed with histogram analysis. To quantify different aspects of the distribution we use average and variation, shape, and diversity.

  • Texture-based analysis refers to the analysis of image patterns, such as intensity, shape, or colors. Mathematical formulas based on the spatial relationship of voxels are used to quantify these concepts.

  • Shape-based measures refer to the study of different components of a structure. They can be divided into 1D metrics (measurement of the distance between two points. They are used to describe the magnitude of an abnormality), 2D metrics (calculated on cross-sectional planes and are used to calculate different parameters that are based on areas) and 3D metrics (attempt to enumerate different aspects of volumetric shape).

  • Transform-based metrics refers to the transformation of images from spatial coordinates to what is called a frequency domain, without losing any information.

Radiomic features

We can obtain multiple types of features from images. Qualitative features are used to describe lesions, and quantitative features are extracted from images using computer programs that apply mathematical algorithms. Now, we will focus on quantitative features, which can be divided into different subgroups.

  • Shape features describe the shape of the traced region of interest and its geometric properties like volume, maximum diameter along different orthogonal directions, maximum surface, tumor compactness, etc.

  • First-order statistics features describe the distribution of individual voxel values without taking into account spatial relationships. Some properties we obtain are the mean, median, maximum, minimum values of the voxel intensities on the image, skewness, kurtosis, etc.

  • Second-order statistics features include the textural features. They are obtained computing the statistical relationships between neighboring voxels.

  • Higher-order statistics features are obtained by statistical methods after applying filters or mathematical transforms to the images. Some examples are fractal analysis, Minkowski functionals or Laplacian transforms of Gaussian-filtered images.


2. Image display (oro.dicom)

In this section we will display a few images using R.

The R package oro.dicom is a collection of input/output functions for medical imaging data that conform to the Digital Imaging and Communications in Medicine (DICOM) standard. The R package RIA is another package that was developed to facilitate radiomic analysis of medical images.

First we load the libraries oro.dicom and RIA.

library(oro.dicom)
library(RIA)

Now we will display 3 images from the same patient (ID00007637202177411956430). To read the images we use the function readDICOMFile(path_of_the_image).

image1 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/11.dcm")
image(t(image1$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")

image2 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/12.dcm")
image(t(image2$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")

image3 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/13.dcm")
image(t(image3$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")


3. Radiomic features (RIA)

Radiomic features without masks (RIA)

First we will compute the radiomic features of the images without masks.

To compute the radiomic features of the images using the library RIA, we first need to convert the DICOM images to RIA_image class. To do so, we will use the load_dicom(path_of_the_file) function.

If we have other image formats, there exist functions such as load_nifti(), load_nrrd() and load_npy() in RIA to read the images.

1 image

First, we compute the radiomic features of 1 image without a mask (we will do it for the first image displayed earlier). To compute the first-order statistics we use the first_order() function.

Some of the metrics computed are the mean, median, energy, entropy, and so on.

DICOM <- RIA::load_dicom(filename="/Users/andrealetaalfonso/Desktop/TFG/images/img/Folder_images/13", skipFirst128=FALSE, DICM=FALSE, boffset = 128)
DICOM = first_order(RIA_data_in = DICOM) # first order statistics
first_order <- RIA:::list_to_df(DICOM$stat_fo$orig) # list of first order statistics
name_characteristics <- rownames(first_order) # names of the first order statistics
library(rmarkdown)
paged_table(first_order)
10 images

Now we will compute the first order characteristics of 10 images (images 10 through 19) from the same patient (ID00007637202177411956430) without a mask.

first_order <- c()
res <- c()
first_image <- 10 # first image to study
last_image <- 19 # last image to study
for (i in first_image:last_image) { # for loop through all the images
  path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "img", "Folder_images", i, fsep="/") # path of every image
  DICOM <- RIA::load_dicom(filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image
  DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
  first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
  res[i-first_image+1] <- first_order[i] # data.frame counts from 1 to 10 (not first_image to last_image)
} 

Finally, we combine all the results into a data.frame called ‘results’.

library(rmarkdown)
results <- do.call(cbind.data.frame, res)
colnames(results) <- seq(1, last_image-first_image+1) 
rownames(results) <- name_characteristics
#results:  results in a data.frame
paged_table(results, options = list(rows.print = 10, cols.print = 5))


4. Segmentation

As we explained in the introduction, segmentation refers to delineating the areas of interest in the image in terms of pixels or voxels, in our case, the lung. Lung segmentation refers to the computer-based process of identifying the boundaries of lungs from surrounding thoracic tissue on the scan images. Once we have the segmented images, we can start our analysis.

Now we will discuss different software to do lung segmentation.


4.1 Segmentation with lungct

We will use the library in R lungct. The lungct R package develops an image processing pipeline for computed tomography (CT) scans of the lungs.

First, we load the libraries needed to do the segmentation. We will use the library dcm2niir to convert DCM files to NIFTI.

library(lungct) 
library(dcm2niir) 
library(ANTsRCore)

Now we have to convert the data from DCM to NII. NIFTI (Neuroimaging Informatics Technology Initiative) is a data format for the storage of Functional Magnetic Resonance Imaging (fMRI) and other medical images.

res <- dcm2niir::dcm2nii(basedir="/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203")
checked <- check_dcm2nii(res) # Manipulating the output

Next we plot some images.

image <- antsImageRead(checked)
plot(image)

We create the masks.

mask <- segment_lung(image)

And finally, we plot some of the masks.

plot(mask$lung_mask)

Now, we will do the same for another set of images.

res2 <- dcm2niir::dcm2nii(basedir="/Users/andrealetaalfonso/Desktop/TFG/images/eclipse/id_1")
checked2 <- check_dcm2nii(res2) # Manipulating the output
image2 <- antsImageRead(checked2)
plot(image2)

mask2 <- segment_lung(image2)
plot(mask2$lung_mask)

Radiomic features with masks (lungct)

In this section we will compute the radiomic features of the images with masks, obtained with lungct.

RIA_lung(
  image, mask$lung_mask, 
  sides = c("right", "left"), 
  features = c("fo"), 
  bins_in = 16, equal_prob = FALSE, distance = 1, 
  statistic = "mean(X, na.rm = TRUE)")

ERROR, IT DOES NOT WORK

Radiomic features with masks with multiple patients (RIA, lungct)

Now we will create multiple .txt files with the results of the radiomic features of different patients, using a mask.

The IDs of the patients are the following.

patient1 <- "ID00007637202177411956430"
patient2 <- "ID00009637202177434476278"
patient3 <- "ID00010637202177584971671"
patient4 <- "ID00011637202177653955184"
patient5 <- "ID00012637202177665765362"
patient6 <- "ID00014637202177757139317"
patient7 <- "ID00015637202177877247924"
patient8 <- "ID00019637202178323708467"
patient9 <- "ID00020637202178344345685"
patient10 <- "ID00023637202179104603099"
vector_patients <- c(patient1, patient2, patient3, patient4, patient5, patient6, patient7, patient8, patient9, patient10) # list of patients ID
for(patient in vector_patients){
  first_order <- c()
  path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, fsep="/") # path of every image
  number_files <- length(list.files(path)) # number of folders (ie. images) in each patient
  for (i in 1:number_files) { # for loop through all the images
    path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, i, fsep="/") # path of every image
    res <- dcm2niir::dcm2nii(basedir=path) # Convert the data from dcm to nii
    checked <- check_dcm2nii(res) # Manipulating the output
    image <- antsImageRead(checked)
    mask <- segment_lung(image) # Lung segmentation
    DICOM <- RIA::load_dicom(filename=path, mask_filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image + mask
    DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
    first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
  } 
  results <- do.call(cbind.data.frame, first_order)
  colnames(results) <- seq(1, number_files)
  rownames(results) <- name_characteristics
  write.table(results, row.names = TRUE, col.names=NA, file=paste0(patient,".txt"), sep="\t") # table with the results
}


4.2. Segmentation with Python (I)

Code adapted from: https://www.kaggle.com/arnavkj95/candidate-generation-and-luna16-preprocessing

Now we will do the segmentation but instead of using R, we will be using Python.

First we need to import some Python libraries.

import numpy as np # pip3 install numpy
import pandas as pd # pip3 install pandas
# pip3 install matplotlib
# pip3 install scipy
import skimage # pip3 install scikit-image
import os 
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import dicom # pip3 install dicom
import scipy.misc
import pydicom # pip3 install pydicom
import matplotlib.pyplot as plt

Each scan consist in multiple slices. We have all the DICOM images from the scan in one folder. In path_images we indicate the path of the folder.

from subprocess import check_output
path_images = "/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/"
# You can check that everything is loading correctly with: print(check_output(["ls", path_images]).decode("utf8"))

To read the images, we will use the function pydicom.read_file(). Then we will update the intensity values of -2000 with 0. These pixels are the ones that are located outside the scanner bounds.

# pip3 install nltk==3.6.2
lung = pydicom.read_file("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/20.dcm")
slice = lung.pixel_array
plt.axis('off')
slice[slice == -2000] = 0
plt.imshow(slice, cmap=plt.cm.gray)
plt.show()

We create a function file_is_hidden() to read only .dcm files and not hidden files in the folder that we cannot see in our computers.

if os.name == 'nt':
    import win32api, win32con
def file_is_hidden(p):
    if os.name== 'nt':
        attribute = win32api.GetFileAttributes(p)
        return attribute & (win32con.FILE_ATTRIBUTE_HIDDEN | win32con.FILE_ATTRIBUTE_SYSTEM)
    else:
        return p.startswith('.') #linux-osx
file_list = [f for f in os.listdir(path_images) if not file_is_hidden(f)] 

Now we will read all the images from a folder with a function named read_ct_scan(folder_name).

def read_ct_scan(folder_name):
  # Read the slices from the dicom file
  slices = [pydicom.read_file(folder_name + filename) for filename in os.listdir(folder_name) if not file_is_hidden(filename)]
  # Sort the dicom slices in their respective order
  slices.sort(key=lambda x: int(x.InstanceNumber))
  # Get the pixel values for all the slices
  slices = np.stack([s.pixel_array for s in slices])
  slices[slices == -2000] = 0
  return slices
  
ct_scan = read_ct_scan(path_images) 

Plot some of the images from a folder.

def plot_ct_scan(scan):
    f, plots = plt.subplots(int(scan.shape[0] / 20) + 1, 4, figsize=(25, 25))
    for i in range(0, scan.shape[0], 5):
        plots[int(i / 20), int((i % 20) / 5)].axis('off')
        plots[int(i / 20), int((i % 20) / 5)].imshow(scan[i], cmap=plt.cm.gray) 

plot_ct_scan(ct_scan)
plt.show()

Now we will do the lung segmentation of 1 image.

def get_segmented_lungs2(im, plot=False):
    if plot == True:
        f, plots = plt.subplots()
    # Step 1: Convert into a binary image. 
    binary = im < 604
    # Step 2: Remove the blobs connected to the border of the image.
    cleared = clear_border(binary)
    # Step 3: Label the image.
    label_image = label(cleared)
    # Step 4: Keep the labels with 2 largest areas.
    areas = [r.area for r in regionprops(label_image)]
    areas.sort()
    if len(areas) > 2:
        for region in regionprops(label_image):
            if region.area < areas[-2]:
                for coordinates in region.coords:                
                       label_image[coordinates[0], coordinates[1]] = 0
    binary = label_image > 0
    # Step 5: Erosion operation with a disk of radius 2. This operation is seperate the lung nodules attached to the blood vessels.
    selem = disk(2)
    binary = binary_erosion(binary, selem)
    # Step 6: Closure operation with a disk of radius 10. This operation is to keep nodules attached to the lung wall.
    selem = disk(10)
    binary = binary_closing(binary, selem)
    # Step 7: Fill in the small holes inside the binary mask of lungs.
    edges = roberts(binary)
    binary = ndi.binary_fill_holes(edges)
    if plot == True:
        plots.axis('off')
        plots.imshow(binary, cmap=plt.cm.bone) 
    # Step 8: Superimpose the binary mask on the input image.
    get_high_vals = binary == 0
    im[get_high_vals] = 0
    return im

Image

Mask of the image

get_segmented_lungs2(ct_scan[20], True)
plt.show()

The steps to obtain the segmentation are the following.

Now we will briefly explain the process of radiomics.

  1. First we convert de image into a binary image. This means that every pixel of the image has only one of two possible values. One black and one white.
  2. We remove the spots connected to the edge of the image.
  3. Create a label.
  4. Keep the labels with two largest areas.
  5. Seperate the lung nodules attached to the blood vessels.
  6. Keep nodules attached to the lung wall.
  7. Fill in the small holes inside the binary mask of lungs.
  8. Superimpose the binary mask on the input image.

We can do the segmentation of all the slices of the scan.

def segment_lung_from_ct_scan(ct_scan):
    return np.asarray([get_segmented_lungs2(slice) for slice in ct_scan])
segmented_ct_scan2 = segment_lung_from_ct_scan(ct_scan)
plot_ct_scan(segmented_ct_scan2)
plt.show()


4.3. Segmentation with Python (II)

Code adapted from: https://www.raddq.com/dicom-processing-segmentation-visualization-in-python/

We will study another way to perform segmentation of lung scans using Python.

As before, first we need to import some Python libraries.

import numpy as np
import dicom
import os
import matplotlib.pyplot as plt
from glob import glob
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage
from skimage import morphology
from skimage import measure
from skimage.transform import resize
# pip3 install scikit-learn
from sklearn.cluster import KMeans
# pip3 install plotly
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.tools import FigureFactory as FF
from plotly.graph_objs import *
import pydicom

Specify the path where the folder with all the images is, and the path where we want our mask images saved.

data_path = "/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/"
output_path = working_path = "/Users/andrealetaalfonso/Desktop/TFG/"

Read all the DICOM files. To do so, we use glob, which is used to return all file paths that match a specific pattern.

g = glob(data_path + '/*.dcm')

Print out the first 5 file names to verify we are in the right folder.

print("Total of %d DICOM images.\nFirst 5 filenames:" % len(g))
## Total of 62 DICOM images.
## First 5 filenames:
print('\n'.join(g[:5]))
## /Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/16.dcm
## /Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/17.dcm
## /Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/15.dcm
## /Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/29.dcm
## /Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/28.dcm

Now we loop over the image files and store everything into a list.

def load_scan(path):
    slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path) if not file_is_hidden(s)]
    slices.sort(key = lambda x: int(x.InstanceNumber))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
    for s in slices:
        s.SliceThickness = slice_thickness
    return slices

The voxel values in the images are raw. This function converts raw values into Houndsfeld units (a quantitative scale for describing radiodensity).

def get_pixels_hu(scans):
    image = np.stack([s.pixel_array for s in scans])
    # Convert to int16 (from sometimes int16), should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)
    # Set outside-of-scan pixels to 1. The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    # Convert to Hounsfield units (HU)
    intercept = scans[0].RescaleIntercept
    slope = scans[0].RescaleSlope
    if slope != 1:
        image = slope * image.astype(np.float64)
        image = image.astype(np.int16)
    image += np.int16(intercept)
    return np.array(image, dtype=np.int16)
id = 0
patient = load_scan(data_path) 
imgs = get_pixels_hu(patient) 

Save in .npy format

np.save(output_path + "fullimages_%d.npy" % (id), imgs)

Now we will check if the Houndsfeld Units (HU) are properly scaled and represented. HU are very important because they are standardized across all CT scans. To give a few examples, air is −1000, lung −500, fat −100 to −50, water 0, blood +30 to +70, muscle +10 to +40 and so on.

Let’s display an histogram to check HU.

file_used=output_path+"fullimages_%d.npy" % id
imgs_to_process = np.load(file_used).astype(np.float64) 
plt.hist(imgs_to_process.flatten(), bins=50, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

With the histogram, we can see that there is a lot of air and water. An abundance of lung, fat, bood and muscle.

We obtain the conclusions that we will need to do preprocessing to study well the lungs only, because there is a lot of air and water.

Now we display some of the images, we will be skipping every two slices.

id = 0
imgs_to_process = np.load(output_path+'fullimages_{}.npy'.format(id))

def sample_stack(stack, rows=4, cols=4, start_with=5, show_every=2):
    fig,ax = plt.subplots(rows,cols,figsize=[12,12])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/rows),int(i % rows)].set_title('slice %d' % ind)
        ax[int(i/rows),int(i % rows)].imshow(stack[ind],cmap='gray')
        ax[int(i/rows),int(i % rows)].axis('off')
    plt.show()

sample_stack(imgs_to_process)

We can see a lot of gray that represents air.

Let’s see how thick each slice is.

print("Slice Thickness: %f" % patient[0].SliceThickness)
## Slice Thickness: 5.000000
print("Pixel Spacing (row, col): (%f, %f) " % (patient[0].PixelSpacing[0], patient[0].PixelSpacing[1]))
## Pixel Spacing (row, col): (0.722656, 0.722656)

Now we do resampling. We want that each slice is resampled in 1x1x1 mm pixels and slices, because it will be useful to obtain the 3d image.

id = 0
imgs_to_process = np.load(output_path+'fullimages_{}.npy'.format(id))
def resample(image, scan, new_spacing=[1,1,1]):
    # Determine current pixel spacing
    spacing = map(float, ([scan[0].SliceThickness] + list(scan[0].PixelSpacing)))
    spacing = np.array(list(spacing))
    resize_factor = spacing / new_spacing
    new_real_shape = image.shape * resize_factor
    new_shape = np.round(new_real_shape)
    real_resize_factor = new_shape / image.shape
    new_spacing = spacing / real_resize_factor
    image = scipy.ndimage.interpolation.zoom(image, real_resize_factor)
    return image, new_spacing
print ("Shape before resampling\t", imgs_to_process.shape)
## Shape before resampling   (62, 512, 512)
imgs_after_resamp, spacing = resample(imgs_to_process, patient, [1,1,1])
## <string>:10: DeprecationWarning:
## 
## Please use `zoom` from the `scipy.ndimage` namespace, the `scipy.ndimage.interpolation` namespace is deprecated.
print ("Shape after resampling\t", imgs_after_resamp.shape)
## Shape after resampling    (310, 370, 370)

3D Plotting

def make_mesh(image, threshold=-300, step_size=1):
    p = image.transpose(2,1,0)
    verts, faces, norm, val = measure.marching_cubes(p, threshold, step_size=step_size, allow_degenerate=True) 
    return verts, faces
def plt_3d(verts, faces):
    x,y,z = zip(*verts) 
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111, projection='3d')
    # Fancy indexing: `verts[faces]` to generate a collection of triangles
    mesh = Poly3DCollection(verts[faces], linewidths=0.05, alpha=1)
    face_color = [1, 1, 0.9]
    mesh.set_facecolor(face_color)
    ax.add_collection3d(mesh)
    ax.set_xlim(0, max(x))
    ax.set_ylim(0, max(y))
    ax.set_zlim(0, max(z))
    ax.set_facecolor((0.7, 0.7, 0.7)) # canvio depricated version!
    plt.show()
v, f = make_mesh(imgs_after_resamp, 350)
plt_3d(v, f)

Segmentation

Now we will do the segmentation of the lungs.

# Standardize the pixel values
def make_lungmask(img, display=False):
    row_size= img.shape[0]
    col_size = img.shape[1]
    mean = np.mean(img)
    std = np.std(img)
    img = img-mean
    img = img/std
    # Find the average pixel value near the lungs to renormalize washed out images
    middle = img[int(col_size/5):int(col_size/5*4),int(row_size/5):int(row_size/5*4)] 
    mean = np.mean(middle)  
    max = np.max(img)
    min = np.min(img)
    # To improve threshold finding, I'm moving the underflow and overflow on the pixel spectrum
    img[img==max]=mean
    img[img==min]=mean
    # Using Kmeans to separate foreground (soft tissue / bone) and background (lung/air)
    kmeans = KMeans(n_clusters=2).fit(np.reshape(middle,[np.prod(middle.shape),1]))
    centers = sorted(kmeans.cluster_centers_.flatten())
    threshold = np.mean(centers)
    thresh_img = np.where(img<threshold,1.0,0.0)  # threshold the image
    # First erode away the finer elements, then dilate to include some of the pixels surrounding the lung.  
    # We don't want to accidentally clip the lung.
    eroded = morphology.erosion(thresh_img,np.ones([3,3]))
    dilation = morphology.dilation(eroded,np.ones([8,8]))
    labels = measure.label(dilation) # Different labels are displayed in different colors
    label_vals = np.unique(labels)
    regions = measure.regionprops(labels)
    good_labels = []
    for prop in regions:
        B = prop.bbox
        if B[2]-B[0]<row_size/10*9 and B[3]-B[1]<col_size/10*9 and B[0]>row_size/5 and B[2]<col_size/5*4:
            good_labels.append(prop.label)
    mask = np.ndarray([row_size,col_size],dtype=np.int8)
    mask[:] = 0
    # After just the lungs are left, we do another large dilation in order to fill in and out the lung mask 
    for N in good_labels:
        mask = mask + np.where(labels==N,1,0)
    mask = morphology.dilation(mask,np.ones([10,10])) # one last dilation
    if (display):
        fig, ax = plt.subplots(3, 2, figsize=[12, 12])
        ax[0, 0].set_title("Original")
        ax[0, 0].imshow(img, cmap='gray')
        ax[0, 0].axis('off')
        ax[0, 1].set_title("Threshold")
        ax[0, 1].imshow(thresh_img, cmap='gray')
        ax[0, 1].axis('off')
        ax[1, 0].set_title("After Erosion and Dilation")
        ax[1, 0].imshow(dilation, cmap='gray')
        ax[1, 0].axis('off')
        ax[1, 1].set_title("Color Labels")
        ax[1, 1].imshow(labels)
        ax[1, 1].axis('off')
        ax[2, 0].set_title("Final Mask")
        ax[2, 0].imshow(mask, cmap='gray')
        ax[2, 0].axis('off')
        ax[2, 1].set_title("Apply Mask on Original")
        ax[2, 1].imshow(mask*img, cmap='gray')
        ax[2, 1].axis('off')
        plt.show()
    return mask*img

We know obtain the segmentation to only one image.

img = imgs_after_resamp[63]
make_lungmask(img, display=True)

Anf finally we apply it to all the slices.

masked_lung = []
for img in imgs_after_resamp:
    masked_lung.append(make_lungmask(img))
sample_stack(masked_lung, show_every=10)

Save masks in .npy format.

np.save(output_path + "maskedimages_%d.npy" % (id), imgs)


4.4. Segmentation with lungmask

And finally, we can also do segmentation with lungmask.

First we import the necessary Python libraries.

from lungmask import mask
import SimpleITK as sitk
import os
import pydicom
from pydicom.data import get_testdata_files
import numpy as np
import matplotlib.pyplot as plt
import cv2
import time

Function to read one image from a specified path.

def get_img_sitk(path):
    return sitk.ReadImage(path)

Function to read multiple images from a specified path.

def get_series_sitk(path):
    reader = sitk.ImageSeriesReader()
    reader.MetaDataDictionaryArrayUpdateOn()
    reader.LoadPrivateTagsOn()
    dicom_names = reader.GetGDCMSeriesFileNames(path)
    reader.SetFileNames(dicom_names)
    return reader.Execute(), reader

Function to generate a mask.

def generate_mask(img):
  segmentation = mask.apply(img)
  segmentation[segmentation > 0] = 1
  if img.GetSize()[2] > 1:
    masked_img = np.zeros((img.GetSize()[2], img.GetSize()[0], img.GetSize()[1]))
    for i in range(1, segmentation.__len__()):
      masked_img[i,:,:] = np.where(segmentation[i,:,:] == 1, sitk.GetArrayFromImage(img)[i,:,:], 0)
  else:
    masked_img = np.where(segmentation == 1, sitk.GetArrayFromImage(img)[0,:,:], 0)
    masked_img = masked_img[0,:,:]
    segmentation = segmentation[0,:,:]
  return segmentation, masked_img

To save the images.

def create_writer(series_reader):
    writer = sitk.ImageFileWriter()
    writer.KeepOriginalImageUIDOn()
    tags_to_copy = ["0010|0010",  # Patient Name
                    "0010|0020",  # Patient ID
                    "0010|0030",  # Patient Birth Date
                    "0020|000D",  # Study Instance UID, for machine consumption
                    "0020|0010",  # Study ID, for human consumption
                    "0008|0020",  # Study Date
                    "0008|0030",  # Study Time
                    "0008|0050",  # Accession Number
                    "0008|0060"  # Modality
                    ]
    modification_time = time.strftime("%H%M%S")
    modification_date = time.strftime("%Y%m%d")
    
    series_tag_values = [
                        (k, series_reader.GetMetaData(0, k))
                        for k in tags_to_copy
                        if series_reader.HasMetaDataKey(0, k)] + \
                    [("0008|0031", modification_time),  # Series Time
                     ("0008|0021", modification_date),  # Series Date
                     ("0008|0008", "DERIVED\\SECONDARY"),  # Image Type
                     ("0020|000e", "1.2.826.0.1.3680043.2.1125." +
                      modification_date + ".1" + modification_time),
                     ("0008|103e","Processed-SimpleITK")]  # Series Description
    return writer, series_tag_values

To obtain and save the masks in DICOM format.

def get_and_save_masks(series_path):
    # Get series of dicom images
    series, series_reader = get_series_sitk(series_path)
    
    # Get masked lungs and mask filters
    mask_filter, masked_lung = generate_mask(series)
    
    # Save masks
    writer, series_tag_values = create_writer(series_reader)
    
    for id in range(len(mask_filter)):
        image_slice = sitk.GetImageFromArray(mask_filter[id,:,:])
        # Tags shared by the series.
        for tag, value in series_tag_values:
            image_slice.SetMetaData(tag, value)
        # Slice specific tags.
        #   Instance Creation Date
        image_slice.SetMetaData("0008|0012", time.strftime("%Y%m%d"))
        #   Instance Creation Time
        image_slice.SetMetaData("0008|0013", time.strftime("%H%M%S"))
        #   Instace Number
        image_slice.SetMetaData("0020|0013", str(id))
        # Check if new directory exists
        if not os.path.exists(series_path + "_mask/"):
          os.makedirs(series_path + "_mask/")
        # Write to the output directory and add the extension dcm, to force writing
        # in DICOM format.
        writer.SetFileName(series_path + "_mask/" + str(id) + ".dcm")
        writer.Execute(image_slice)

Plot one image.

ds = pydicom.dcmread("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1/std_38.dcm")
plt.imshow(ds.pixel_array, cmap='gray')
plt.show()

Now we create a mask for the image above.

image1 <- get_img_sitk("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1/std_38.dcm")
mask_filter_single, masked_lung_single = generate_mask(image1)
plt.imshow(mask_filter_single, cmap='gray')
plt.show()
plt.imshow(masked_lung_single, cmap='gray')
plt.show()

We can visualize it all in one panel.

fig, ax = plt.subplots(1,3)
ax[0].set_title("Original", fontsize='small')
ax[0].imshow(ds.pixel_array, cmap='gray')
ax[0].axis('off')

ax[1].set_title("Segmented", fontsize='small')
ax[1].imshow(mask_filter_single, cmap='gray')
ax[1].axis('off')

ax[2].set_title("Masked", fontsize='small')
ax[2].imshow(masked_lung_single, cmap='gray')
ax[2].axis('off')

plt.show()

Now, we can do the same for all of the images in the folder.

We generate the masks and save them in the path + _mask.

get_and_save_masks("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1")


5. Radiomic features with masks

Now that we have in one folder all the images, and in another one all the masks, we can start computing the radiomic features of the images with the masks.

We will be using lungmask.

Load the images and the masks with the funcion load_dicom().

library(RIA)
images = load_dicom(filename = "/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1", 
                    mask_filename = "/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1_mask")
## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.

Compute the radiomic features of the images.

images = first_order(RIA_data_in = images)
results <- RIA:::list_to_df(images$stat_fo$orig)
name_characteristics <- rownames(results)
library(rmarkdown)
paged_table(results)

Multiple patients at the same time

Now we will do the same but for 5 patients (in our case) at the same time.

First obtain and save the mask of the images.

# List of all the patient folders
patients = [f for f in os.listdir("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle") if not f.startswith('.')]

for i in patients: # for loop through all the patients
    path = os.path.join("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", i)
    get_and_save_masks(path) # obtain and save masks

Now we compute the radiomic features.

library(RIA)
library(rmarkdown)

res <- c()

images_folder_names <- grep(list.files(path="/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle"), pattern='mask', invert=TRUE, value=TRUE)
mask_folder_names <- list.files("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle", pattern="mask")

for(patient in 1:length(images_folder_names)){
  images_folder_path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", images_folder_names[patient], fsep="/")
  mask_folder_path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", mask_folder_names[patient], fsep="/")
  
  images = load_dicom(filename = images_folder_path, mask_filename = mask_folder_path)
  images = first_order(RIA_data_in = images)
  res[patient] <- RIA:::list_to_df(images$stat_fo$orig)
}
## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.

## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.

## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.

## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.

## Warning in oro.dicom::create3D(dcmImages_mask, mode = mode_in, transpose =
## transpose_in, : ImagePositionPatient is moving in more than one dimension.
results <- do.call(cbind.data.frame, res)
colnames(results) <- seq(1, length(images_folder_names))
rownames(results) <- name_characteristics
paged_table(results)
results <- do.call(cbind.data.frame, res)
colnames(results) <- seq(1, length(images_folder_names))
rownames(results) <- name_characteristics
paged_table(results)


Information about the patients

We have a .csv file with information about the patients.

Let’s open the file and view some of the rows.

data_patients <- read.csv('/Users/andrealetaalfonso/Desktop/TFG/images/train.csv')
data_patients <- as.data.frame(data_patients)
head(data_patients)
##                     Patient Weeks  FVC  Percent Age  Sex SmokingStatus
## 1 ID00007637202177411956430    -4 2315 58.25365  79 Male     Ex-smoker
## 2 ID00007637202177411956430     5 2214 55.71213  79 Male     Ex-smoker
## 3 ID00007637202177411956430     7 2061 51.86210  79 Male     Ex-smoker
## 4 ID00007637202177411956430     9 2144 53.95068  79 Male     Ex-smoker
## 5 ID00007637202177411956430    11 2069 52.06341  79 Male     Ex-smoker
## 6 ID00007637202177411956430    17 2101 52.86865  79 Male     Ex-smoker

The data we are provided with means the following:

  • Patient: a unique Id for each patient (also the name of the patient’s DICOM folder)

  • Weeks: the relative number of weeks pre/post the baseline CT (may be negative)

  • FVC: the recorded lung capacity in ml

  • Percent: a computed field which approximates the patient’s FVC as a percent of the typical FVC for a person of similar characteristics

  • Age

  • Sex

  • SmokingStatus

We will only select the information about the patients that we have.

data_patients[data_patients$Patient %in% images_folder_names,]
##                       Patient Weeks  FVC   Percent Age    Sex SmokingStatus
## 1   ID00007637202177411956430    -4 2315  58.25365  79   Male     Ex-smoker
## 2   ID00007637202177411956430     5 2214  55.71213  79   Male     Ex-smoker
## 3   ID00007637202177411956430     7 2061  51.86210  79   Male     Ex-smoker
## 4   ID00007637202177411956430     9 2144  53.95068  79   Male     Ex-smoker
## 5   ID00007637202177411956430    11 2069  52.06341  79   Male     Ex-smoker
## 6   ID00007637202177411956430    17 2101  52.86865  79   Male     Ex-smoker
## 7   ID00007637202177411956430    29 2000  50.32713  79   Male     Ex-smoker
## 8   ID00007637202177411956430    41 2064  51.93759  79   Male     Ex-smoker
## 9   ID00007637202177411956430    57 2057  51.76145  79   Male     Ex-smoker
## 63  ID00019637202178323708467    13 2100  92.85872  83 Female     Ex-smoker
## 64  ID00019637202178323708467    14 2047  90.51514  83 Female     Ex-smoker
## 65  ID00019637202178323708467    16 2001  88.48110  83 Female     Ex-smoker
## 66  ID00019637202178323708467    18 2054  90.82467  83 Female     Ex-smoker
## 67  ID00019637202178323708467    20 2002  88.52532  83 Female     Ex-smoker
## 68  ID00019637202178323708467    26 2116  93.56622  83 Female     Ex-smoker
## 69  ID00019637202178323708467    38 1946  86.04908  83 Female     Ex-smoker
## 70  ID00019637202178323708467    53 1808  79.94694  83 Female     Ex-smoker
## 71  ID00019637202178323708467    66 1778  78.62038  83 Female     Ex-smoker
## 72  ID00020637202178344345685    18 2297 117.77071  66 Female  Never smoked
## 73  ID00020637202178344345685    19 2145 109.97744  66 Female  Never smoked
## 74  ID00020637202178344345685    21 2245 115.10459  66 Female  Never smoked
## 75  ID00020637202178344345685    23 2250 115.36095  66 Female  Never smoked
## 76  ID00020637202178344345685    25 2096 107.46514  66 Female  Never smoked
## 77  ID00020637202178344345685    31 2115 108.43929  66 Female  Never smoked
## 78  ID00020637202178344345685    44 2058 105.51682  66 Female  Never smoked
## 79  ID00020637202178344345685    54 1940  99.46678  66 Female  Never smoked
## 80  ID00020637202178344345685    70 1861  95.41632  66 Female  Never smoked
## 81  ID00023637202179104603099    -3 1536  65.30612  71 Female     Ex-smoker
## 82  ID00023637202179104603099     3 1368  58.16327  71 Female     Ex-smoker
## 83  ID00023637202179104603099     5 1361  57.86565  71 Female     Ex-smoker
## 84  ID00023637202179104603099     7 1465  62.28741  71 Female     Ex-smoker
## 85  ID00023637202179104603099     9 1681  71.47109  71 Female     Ex-smoker
## 86  ID00023637202179104603099    15 1461  62.11735  71 Female     Ex-smoker
## 87  ID00023637202179104603099    27 1337  56.84524  71 Female     Ex-smoker
## 88  ID00023637202179104603099    39 1304  55.44218  71 Female     Ex-smoker
## 89  ID00023637202179104603099    55 1406  59.77891  71 Female     Ex-smoker
## 699 ID00184637202242062969203    29 3130  67.31183  52   Male     Ex-smoker
## 700 ID00184637202242062969203    30 2965  63.76344  52   Male     Ex-smoker
## 701 ID00184637202242062969203    32 3197  68.75269  52   Male     Ex-smoker
## 702 ID00184637202242062969203    34 2761  59.37634  52   Male     Ex-smoker
## 703 ID00184637202242062969203    37 3009  64.70968  52   Male     Ex-smoker
## 704 ID00184637202242062969203    43 3175  68.27957  52   Male     Ex-smoker
## 705 ID00184637202242062969203    55 3021  64.96774  52   Male     Ex-smoker
## 706 ID00184637202242062969203    67 2991  64.32258  52   Male     Ex-smoker
## 707 ID00184637202242062969203    83 2743  58.98925  52   Male     Ex-smoker


Bibliography

Kumar et al. (2012)

Mackin et al. (2015)

Kolossváry et al. (2018)

Mayerhoefer et al. (2020)

Timmeren et al. (2020)

Rizzo et al. (2018)

Hofmanninger et al. (2020)

Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., & Langs, G. (2020). Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental, 4(1), 1–13.
Kolossváry, M., Kellermayer, M., Merkely, B., & Maurovich-Horvat, P. (2018). Cardiac computed tomography radiomics. Journal of Thoracic Imaging, 33(1), 26–34.
Kumar, V., Gu, Y., Basu, S., Berglund, A., Eschrich, S. A., Schabath, M. B., Forster, K., Aerts, H. J. W. L., Dekker, A., Fenstermacher, D., Goldgof, D. B., Hall, L. O., Lambin, P., Balagurunathan, Y., Gatenby, R. A., & Gillies, R. J. (2012). Radiomics: The process and the challenges. Magnetic Resonance Imaging, 30(9), 1234–1248. https://doi.org/https://doi.org/10.1016/j.mri.2012.06.010
Mackin, D., Fave, X., Zhang, L., Fried, D., Yang, J., Taylor, B., Rodriguez-Rivera, E., Dodge, C., Jones, A. K., & Court, L. (2015). Measuring CT scanner variability of radiomics features. Investigative Radiology, 50(11), 757.
Mayerhoefer, M. E., Materka, A., Langs, G., Häggström, I., Szczypiński, P., Gibbs, P., & Cook, G. (2020). Introduction to radiomics. Journal of Nuclear Medicine, 61(4), 488–495.
Rizzo, S., Botta, F., Raimondi, S., Origgi, D., Fanciullo, C., Morganti, A. G., & Bellomi, M. (2018). Radiomics: The facts and the challenges of image analysis. European Radiology Experimental, 2(1), 1–8.
Timmeren, J. E. van, Cester, D., Tanadini-Lang, S., Alkadhi, H., & Baessler, B. (2020). Radiomics in medical imaging—“how-to” guide and critical reflection. Insights into Imaging, 11(1), 1–16.